33 research outputs found
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
In this paper, we introduce the task of visual grounding for remote sensing
data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS)
images with the guidance of natural language. To retrieve rich information from
RS imagery using natural language, many research tasks, like RS image visual
question answering, RS image captioning, and RS image-text retrieval have been
investigated a lot. However, the object-level visual grounding on RS images is
still under-explored. Thus, in this work, we propose to construct the dataset
and explore deep learning models for the RSVG task. Specifically, our
contributions can be summarized as follows. 1) We build the new large-scale
benchmark dataset of RSVG, termed RSVGD, to fully advance the research of RSVG.
This new dataset includes image/expression/box triplets for training and
evaluating visual grounding models. 2) We benchmark extensive state-of-the-art
(SOTA) natural image visual grounding methods on the constructed RSVGD dataset,
and some insightful analyses are provided based on the results. 3) A novel
transformer-based Multi-Level Cross-Modal feature learning (MLCM) module is
proposed. Remotely-sensed images are usually with large scale variations and
cluttered backgrounds. To deal with the scale-variation problem, the MLCM
module takes advantage of multi-scale visual features and multi-granularity
textual embeddings to learn more discriminative representations. To cope with
the cluttered background problem, MLCM adaptively filters irrelevant noise and
enhances salient features. In this way, our proposed model can incorporate more
effective multi-level and multi-modal features to boost performance.
Furthermore, this work also provides useful insights for developing better RSVG
models. The dataset and code will be publicly available at
https://github.com/ZhanYang-nwpu/RSVG-pytorch.Comment: 12 pages, 10 figure
Change Detection Meets Visual Question Answering
The Earth's surface is continually changing, and identifying changes plays an
important role in urban planning and sustainability. Although change detection
techniques have been successfully developed for many years, these techniques
are still limited to experts and facilitators in related fields. In order to
provide every user with flexible access to change information and help them
better understand land-cover changes, we introduce a novel task: change
detection-based visual question answering (CDVQA) on multi-temporal aerial
images. In particular, multi-temporal images can be queried to obtain high
level change-based information according to content changes between two input
images. We first build a CDVQA dataset including multi-temporal
image-question-answer triplets using an automatic question-answer generation
method. Then, a baseline CDVQA framework is devised in this work, and it
contains four parts: multi-temporal feature encoding, multi-temporal fusion,
multi-modal fusion, and answer prediction. In addition, we also introduce a
change enhancing module to multi-temporal feature encoding, aiming at
incorporating more change-related information. Finally, effects of different
backbones and multi-temporal fusion strategies are studied on the performance
of CDVQA task. The experimental results provide useful insights for developing
better CDVQA models, which are important for future research on this task. We
will make our dataset and code publicly available
VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection
Although traffic sign detection has been studied for years and great progress
has been made with the rise of deep learning technique, there are still many
problems remaining to be addressed. For complicated real-world traffic scenes,
there are two main challenges. Firstly, traffic signs are usually small size
objects, which makes it more difficult to detect than large ones; Secondly, it
is hard to distinguish false targets which resemble real traffic signs in
complex street scenes without context information. To handle these problems, we
propose a novel end-to-end deep learning method for traffic sign detection in
complex environments. Our contributions are as follows: 1) We propose a
multi-resolution feature fusion network architecture which exploits densely
connected deconvolution layers with skip connections, and can learn more
effective features for the small size object; 2) We frame the traffic sign
detection as a spatial sequence classification and regression task, and propose
a vertical spatial sequence attention (VSSA) module to gain more context
information for better detection performance. To comprehensively evaluate the
proposed method, we do experiments on several traffic sign datasets as well as
the general object detection dataset and the results have shown the
effectiveness of our proposed method
RSSOD-Bench: A large-scale benchmark dataset for Salient Object Detection in Optical Remote Sensing Imagery
We present the RSSOD-Bench dataset for salient object detection (SOD) in
optical remote sensing imagery. While SOD has achieved success in natural scene
images with deep learning, research in SOD for remote sensing imagery (RSSOD)
is still in its early stages. Existing RSSOD datasets have limitations in terms
of scale, and scene categories, which make them misaligned with real-world
applications. To address these shortcomings, we construct the RSSOD-Bench
dataset, which contains images from four different cities in the USA. The
dataset provides annotations for various salient object categories, such as
buildings, lakes, rivers, highways, bridges, aircraft, ships, athletic fields,
and more. The salient objects in RSSOD-Bench exhibit large-scale variations,
cluttered backgrounds, and different seasons. Unlike existing datasets,
RSSOD-Bench offers uniform distribution across scene categories. We benchmark
23 different state-of-the-art approaches from both the computer vision and
remote sensing communities. Experimental results demonstrate that more research
efforts are required for the RSSOD task.Comment: IGARSS 2023, 4 page
Few-shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects
Object detection is an essential and fundamental task in computer vision and
satellite image processing. Existing deep learning methods have achieved
impressive performance thanks to the availability of large-scale annotated
datasets. Yet, in real-world applications the availability of labels is
limited. In this context, few-shot object detection (FSOD) has emerged as a
promising direction, which aims at enabling the model to detect novel objects
with only few of them annotated. However, many existing FSOD algorithms
overlook a critical issue: when an input image contains multiple novel objects
and only a subset of them are annotated, the unlabeled objects will be
considered as background during training. This can cause confusions and
severely impact the model's ability to recall novel objects. To address this
issue, we propose a self-training-based FSOD (ST-FSOD) approach, which
incorporates the self-training mechanism into the few-shot fine-tuning process.
ST-FSOD aims to enable the discovery of novel objects that are not annotated,
and take them into account during training. On the one hand, we devise a
two-branch region proposal networks (RPN) to separate the proposal extraction
of base and novel objects, On another hand, we incorporate the student-teacher
mechanism into RPN and the region of interest (RoI) head to include those
highly confident yet unlabeled targets as pseudo labels. Experimental results
demonstrate that our proposed method outperforms the state-of-the-art in
various FSOD settings by a large margin. The codes will be publicly available
at https://github.com/zhu-xlab/ST-FSOD
HTC-DC Net: Monocular Height Estimation from Single Remote Sensing Images
3D geo-information is of great significance for understanding the living
environment; however, 3D perception from remote sensing data, especially on a
large scale, is restricted. To tackle this problem, we propose a method for
monocular height estimation from optical imagery, which is currently one of the
richest sources of remote sensing data. As an ill-posed problem, monocular
height estimation requires well-designed networks for enhanced representations
to improve performance. Moreover, the distribution of height values is
long-tailed with the low-height pixels, e.g., the background, as the head, and
thus trained networks are usually biased and tend to underestimate building
heights. To solve the problems, instead of formalizing the problem as a
regression task, we propose HTC-DC Net following the classification-regression
paradigm, with the head-tail cut (HTC) and the distribution-based constraints
(DCs) as the main contributions. HTC-DC Net is composed of the backbone network
as the feature extractor, the HTC-AdaBins module, and the hybrid regression
process. The HTC-AdaBins module serves as the classification phase to determine
bins adaptive to each input image. It is equipped with a vision transformer
encoder to incorporate local context with holistic information and involves an
HTC to address the long-tailed problem in monocular height estimation for
balancing the performances of foreground and background pixels. The hybrid
regression process does the regression via the smoothing of bins from the
classification phase, which is trained via DCs. The proposed network is tested
on three datasets of different resolutions, namely ISPRS Vaihingen (0.09 m),
DFC19 (1.3 m) and GBH (3 m). Experimental results show the superiority of the
proposed network over existing methods by large margins. Extensive ablation
studies demonstrate the effectiveness of each design component.Comment: 18 pages, 10 figures, submitted to IEEE Transactions on Geoscience
and Remote Sensin
Knowledge Transfer for Label-efficient Monocular Height Estimation
Estimating height from monocular remote sensing images is one of the most efficient ways for building large-scale 3D city models. However, existing deep learning based methods usually require a large amount of training data, which could be cost-consuming or even not possible to obtain. Towards a label-efficient deep learning model, we propose a new task and dataset for weak-shot monocular height estimation. In this task, only the relative height labels between pairs of a small portion of points are given, which is cheaper and more friendly for humans to annotate. In addition, to enhance the model performance under the sparse and weak-shot supervision, we propose a Transformer-based network for transferring the learned knowledge from a large-scale synthetic dataset to real-world data. Experimental results have shown the effectiveness of the proposed method on a public dataset under the sparse and weak supervision
GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data
Geometric information in the normalized digital surface models (nDSM) is
highly correlated with the semantic class of the land cover. Exploiting two
modalities (RGB and nDSM (height)) jointly has great potential to improve the
segmentation performance. However, it is still an under-explored field in
remote sensing due to the following challenges. First, the scales of existing
datasets are relatively small and the diversity of existing datasets is
limited, which restricts the ability of validation. Second, there is a lack of
unified benchmarks for performance assessment, which leads to difficulties in
comparing the effectiveness of different models. Last, sophisticated
multi-modal semantic segmentation methods have not been deeply explored for
remote sensing data. To cope with these challenges, in this paper, we introduce
a new remote-sensing benchmark dataset for multi-modal semantic segmentation
based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of
existing methods, the proposed benchmark consists of 1) a large-scale dataset
including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a
comprehensive evaluation and analysis of existing multi-modal fusion strategies
for both convolutional and Transformer-based networks on remote sensing data.
Furthermore, we propose a novel and effective Transformer-based intermediary
multi-modal fusion (TIMF) module to improve the semantic segmentation
performance through adaptive token-level multi-modal fusion.The designed
benchmark can foster future research on developing new methods for multi-modal
learning on remote sensing data. Extensive analyses of those methods are
conducted and valuable insights are provided through the experimental results.
Code for the benchmark and baselines can be accessed at
\url{https://github.com/EarthNets/RSI-MMSegmentation}.Comment: 13 page
Pyrolysis treatment of nonmetal fraction of waste printed circuit boards : Focusing on the fate of bromine
Advanced thermal treatment of electronic waste offers advantages of volume reduction and energy recovery. In this work, the pyrolysis behaviour of nonmetallic fractions of waste printed circuit boards was studied. The fate of a bromine and thermal decomposition pathway of nonmetallic fractions of waste printed circuit boards were further probed. The thermogravimetric analysis showed that the temperatures of maximum mass loss were located at 319°C and 361°C, with mass loss of 29.6% and 50.6%, respectively. The Fourier transform infrared Spectroscopy analysis revealed that the spectra at temperatures of 300°C–400°C were complicated with larger absorbance intensity. The nonmetallic fractions of waste printed circuit boards decomposed drastically and more evolved products were detected in the temperature range of 600°C–1000°C. The gas chromatography–mass spectrometry analysis indicated that various brominated derivates were generated in addition to small molecules, such as CH4, H2O and CO. The release intensity of CH4 and H2O increased with temperature increasing and reached maximum at 600°C–800°C and 400°C–600°C. More bromoethane (C2H5Br) was formed as compared with HBr and methyl bromide (CH3Br). The release intensity of bromopropane (C3H7Br) and bromoacetone (C3H5BrO) were comparable, although smaller than that of bromopropene (C3H5Br). More dibromophenol (C6H4Br2O) was released than that of bromophenol (C6H5BrO) in the thermal treatment. During the thermal process, part of the ether bonds first ruptured forming bisphenol A, propyl alcohol and tetrabromobisphenol A. Then, the tetrabromobisphenol A decomposed into C6H5BrO and HBr, which further reacted with small molecules forming brominated derivates. It implied debromination of raw nonmetallic fractions of waste printed circuit boards or pyrolysis products should be applied for its environmentally sound treating.© 2020 Sage. The article is protected by copyright and reuse is restricted to non-commercial and no derivative uses. Users may also download and save a local copy of an article accessed in an institutional repository for the user's personal reference.fi=vertaisarvioitu|en=peerReviewed